In this unit we discuss several (adaptive) MCMC strategies. We will implement these proposals on a logit regression model and using the “famous” Pima indian dataset. The purpose of this unit is mainly presenting the code of the implementation, not benchmarking the performance of the various MCMC algorithms. Refer to the nice paper by Chopin & Ridgway (2017) for a more comprehensive discussion on this aspect.
In first place, let us load the Pima.tr and Pima.te datasets which are available in the MASS R package. As described in the documentation, this dataset includes:
A population of women who were at least 21 years old, of Pima Indian heritage and living near Phoenix, Arizona, was tested for diabetes according to World Health Organization criteria. The data were collected by the US National Institute of Diabetes and Digestive and Kidney Diseases. We used the 532 complete records after dropping the (mainly missing) data on serum insulin.
See also the documentation of the Pima.tr and Pima.te datasets for a more comprehensive description of the involved variables. We combine the training and test dataset into a single data.frame as follows:
Pima <- rbind(MASS::Pima.tr, MASS::Pima.te)We are interested in modeling the probability of being diabetic (variable type) as a function of the covariates.
y <- as.numeric(Pima$type == "Yes") # Binary response
X <- model.matrix(type ~ . - 1, data = Pima) # Design matrix
X <- cbind(1, scale(X)) # Standardize the dataLet \(\textbf{y} = (y_1,\dots,y_n)^\intercal\) be the vector of the observed binary responses (variable type) and let \(\textbf{X}\) be the corresponding design matrix whose generic row is \(\textbf{x}_i = (x_{i1},\dots,x_{ip})^\intercal\), for \(i=1,\dots,p\). We employ a generalized linear model such that
\[
y_i \mid \pi_i \overset{\text{ind}}{\sim} \text{Bern}(\pi_i), \qquad \pi_i = g(\eta_i), \qquad \eta_i = \beta_1x_{i1} + \cdots + \beta_p x_{ip},
\] where the link function \(g(\cdot)\) is either inverse logit transform (plogis function) or the cumulative distribution function of a standard normal (pnorm function). We wish to estimate the parameter vector \({\bf \beta} = (\beta_1,\dots,\beta_p)^\intercal\). To fix the ideas, we are interested in the Bayesian counterpart of the following models for binary data:
# Probit model
fit_probit <- glm(type ~ X - 1, family = binomial(link = "probit"), data = Pima)
# Logit model
fit_logit <- glm(type ~ X - 1, family = binomial(link = "logit"), data = Pima)The focus of this presentation is on the computational aspects, so we will employ a relatively vague prior centered at \(0\), namely \(\beta \sim N(0, 100 I_p)\). The reader should be aware that much better alternatives exist for example if one aims at specifying an “uninformative” prior.
In first place, we specify the log-likelihood and the log-posterior functions. We consider here the logistic case.
# Loglikelihood of a logistic regression model
loglik <- function(beta, y, X) {
eta <- c(X %*% beta)
sum(y * eta - log(1 + exp(eta)))
}
# Logposterior
logpost <- function(beta, y, X) {
loglik(beta, y, X) + sum(dnorm(beta, 0, 10, log = T))
}# R represent the number of samples
# burn_in is the number of discarded samples
# S is the covariance matrix of the multivariate Gaussian proposal
RMH <- function(R, burn_in, y, X, S) {
p <- ncol(X)
out <- matrix(0, R, p) # Initialize an empty matrix to store the values
beta <- rep(0, p) # Initial values
# Eigen-decomposition
eig <- eigen(S, symmetric = TRUE)
A1 <- t(eig$vectors) * sqrt(eig$values)
# Starting the Gibbs sampling
for (r in 1:(burn_in + R)) {
beta_new <- beta + c(matrix(rnorm(p), 1, p) %*% A1)
alpha <- min(1, exp(logpost(beta_new, y, X) - logpost(beta, y, X)))
if (runif(1) < alpha) {
beta <- beta_new # Accept the value
}
# Store the values after the burn-in period
if (r > burn_in) {
out[r - burn_in, ] <- beta
}
}
out
}library(coda)
R <- 30000
burn_in <- 30000set.seed(123)
S <- diag(1e-3, ncol(X))
# Running the MCMC
fit_MCMC <- as.mcmc(RMH(R, burn_in, y, X, S)) # Convert the matrix into a "coda" object
# Diagnostic
effectiveSize(fit_MCMC) # Effective sample size
var1 var2 var3 var4 var5 var6 var7 var8
318.2368 207.2123 300.9986 328.1145 198.1866 215.9027 333.1073 174.9089
R / effectiveSize(fit_MCMC) # Integrated autocorrelation time
var1 var2 var3 var4 var5 var6 var7 var8
94.26943 144.77904 99.66825 91.43151 151.37246 138.95146 90.06107 171.51783
1 - rejectionRate(fit_MCMC) # Acceptance rate
var1 var2 var3 var4 var5 var6 var7 var8
0.719124 0.719124 0.719124 0.719124 0.719124 0.719124 0.719124 0.719124
# Traceplot of the intercept
plot(fit_MCMC[, 1:2])set.seed(123)
p <- ncol(X)
S <- 2.38^2 * vcov(fit_logit) / p
# Running the MCMC
fit_MCMC <- as.mcmc(RMH(R, burn_in, y, X, S)) # Convert the matrix into a "coda" object
# Diagnostic
effectiveSize(fit_MCMC) # Effective sample size
var1 var2 var3 var4 var5 var6 var7 var8
1193.711 1106.886 1224.056 1239.297 1184.106 1143.590 1219.163 1244.564
R / effectiveSize(fit_MCMC) # Integrated autocorrelation time
var1 var2 var3 var4 var5 var6 var7 var8
25.13171 27.10306 24.50867 24.20728 25.33557 26.23318 24.60704 24.10482
1 - rejectionRate(fit_MCMC) # Acceptance rate
var1 var2 var3 var4 var5 var6 var7 var8
0.2745758 0.2745758 0.2745758 0.2745758 0.2745758 0.2745758 0.2745758 0.2745758
# Traceplot of the intercept
plot(fit_MCMC[, 1:2])We consider here a version of the Adaptive Metropolis (AM) algorithm of Haario, Saksman, and Tamminem (2001). The algorithm proceeds as in the standard Metropolis algorithm, but the covariance matrix of the proposal is updated at each iteration.
We use the following (adaptive) proposal distribution:
\[q_r({\bf \beta}^* \mid {\bf \beta}) \sim N({\bf \beta}, \:2.38^2 / p \: \Sigma_r + \epsilon I_p),\]
where \(\Sigma_r\) is the covariance matrix of the previously \(r\) sampled values \(\beta^{(1)},\dots,\beta^{(r)}\) and \(\epsilon > 0\) is some small value that avoid degeneracies (i.e. the matrix \(\Sigma_r\) must be invertible). Here, we will use \(\epsilon = 10^{-6}\). Moreover, note that the following recursive formula holds true: \[ \Sigma_r = \frac{1}{r}\sum_{j=1}^r(\beta^{(j)} - \bar{\beta}^{(r)})^\intercal(\beta^{(j)} - \bar{\beta}^{(r)}) = \frac{r - 1}{r}\Sigma_{r-1} + \frac{1}{r}(\beta^{(r)} - \bar{\beta}^{(r-1)})^\intercal(\beta^{(r)} - \bar{\beta}^{(r-1)}). \]
where \(\bar{\beta}^{(r)} = (r - 1)/r \bar{\beta}^{(r-1)} + \beta^{(r)} / r\) is the arithmetic means of the first \(r\) sampled values. This facilitates the computation of \(\Sigma_r\). The code for this AM algorithm is given in the following chunk.
# R represent the number of samples
# burn_in is the number of discarded samples
# S is the covariance matrix of the multivariate Gaussian proposal
RMH_Adaptive <- function(R, burn_in, y, X) {
p <- ncol(X)
out <- matrix(0, R, p) # Initialize an empty matrix to store the values
vars <- matrix(0, R + burn_in, p) # Just for teaching purposes - store the variances
beta <- rep(0, p) # Initial values
epsilon <- 1e-6
# Initial matrix S
S <- diag(epsilon, p)
Sigma_r <- diag(0, p)
mu_r <- beta
for (r in 1:(burn_in + R)) {
# Updating the covariance matrix
Sigma_r <- (r - 1) / r * Sigma_r + tcrossprod(beta - mu_r) / r
mu_r <- (r - 1) / r * mu_r + beta / r
S <- 2.38^2 * Sigma_r / p + diag(epsilon, p)
# Eigen-decomposition
eig <- eigen(S, symmetric = TRUE)
A1 <- t(eig$vectors) * sqrt(eig$values)
beta_new <- beta + c(matrix(rnorm(p), 1, p) %*% A1)
alpha <- min(1, exp(logpost(beta_new, y, X) - logpost(beta, y, X)))
if (runif(1) < alpha) {
beta <- beta_new # Accept the value
}
vars[r, ] <- diag(Sigma_r) # For teaching purposes, let us store the variances
# Store the values after the burn-in period
if (r > burn_in) {
out[r - burn_in, ] <- beta
}
}
list(beta = out, vars = vars)
}set.seed(123)
# Running the MCMC
fit_MCMC <- RMH_Adaptive(R = R, burn_in = burn_in, y, X)
beta_MCMC <- as.mcmc(fit_MCMC$beta) # Convert the matrix into a "coda" object
# Diagnostic
effectiveSize(beta_MCMC) # Effective sample size
var1 var2 var3 var4 var5 var6 var7 var8
1284.6719 1036.3597 1184.7249 1039.2479 869.2933 985.3165 1008.0534 891.9833
R / effectiveSize(beta_MCMC) # Integrated autocorrelation time
var1 var2 var3 var4 var5 var6 var7 var8
23.35227 28.94748 25.32233 28.86703 34.51079 30.44707 29.76033 33.63292
1 - rejectionRate(beta_MCMC) # Acceptance rate
var1 var2 var3 var4 var5 var6 var7 var8
0.2057735 0.2057735 0.2057735 0.2057735 0.2057735 0.2057735 0.2057735 0.2057735
# Traceplot of the intercept
plot(beta_MCMC[, 1:2])
# Traceplot of the variance of some variables
par(mfrow = c(2, 1))
plot(fit_MCMC$vars[, 1], type = "l", main = "Variance of variable 1", ylab = "Variance of variable 2")
plot(fit_MCMC$vars[, 2], type = "l", main = "Variance of variable 2", ylab = "Variance of variable 3")RMH_Gibbs <- function(R, burn_in, y, X, se) {
p <- ncol(X)
out <- matrix(0, R, p) # Initialize an empty matrix to store the values
beta <- beta_new <- rep(0, p) # Initial values
for (r in 1:(burn_in + R)) {
for (j in 1:p) {
beta_new[j] <- beta[j] + rnorm(1, 0, se[j])
alpha <- min(1, exp(logpost(beta_new, y, X) - logpost(beta, y, X)))
if (runif(1) < alpha) {
beta[j] <- beta_new[j] # Accept the value
}
}
# Store the values after the burn-in period
if (r > burn_in) {
out[r - burn_in, ] <- beta
}
}
out
}p <- ncol(X)
se <- sqrt(rep(1e-4, p))
set.seed(123)
# Running the MCMC
fit_MCMC <- as.mcmc(RMH_Gibbs(R = R, burn_in = burn_in, y, X, se)) # Convert the matrix into a "coda" object
# Diagnostic
effectiveSize(fit_MCMC) # Effective sample size
var1 var2 var3 var4 var5 var6 var7 var8
45.95430 41.38699 40.84355 46.78707 37.52987 28.46185 48.15595 41.35603
R / effectiveSize(fit_MCMC) # Integrated autocorrelation time
var1 var2 var3 var4 var5 var6 var7 var8
652.8225 724.8655 734.5102 641.2028 799.3633 1054.0427 622.9759 725.4081
1 - rejectionRate(fit_MCMC) # Acceptance rate
var1 var2 var3 var4 var5 var6 var7 var8
0.9543651 0.9538651 0.9536985 0.9538985 0.9526318 0.9530984 0.9556985 0.9540318
# Traceplot of the intercept
plot(fit_MCMC[, 1:2])RMH_Gibbs_Adaptive <- function(R, burn_in, y, X, target = 0.44) {
p <- ncol(X)
out <- matrix(0, R, p) # Initialize an empty matrix to store the values
beta <- beta_new <- rep(0, p) # Initial values
epsilon <- rep(0, p) # Initial value
accepted <- numeric(p) # Vector of accepted values
batch <- 1
for (r in 1:(burn_in + R)) {
# Do we need to update the parameters?
if (batch == 50) {
for (j in 1:p) {
# ADAPTATIO
if ((accepted[j] / 50) > target) {
epsilon[j] <- epsilon[j] + min(0.01, sqrt(1 / r))
} else {
epsilon[j] <- epsilon[j] - min(0.01, sqrt(1 / r))
}
}
# Restart the cycle - Erase everything
accepted <- numeric(p) # Vector of accepted values
batch <- 0
}
# Increment the batch
batch <- batch + 1
for (j in 1:p) {
beta_new[j] <- beta[j] + rnorm(1, 0, exp(epsilon[j]))
alpha <- min(1, exp(logpost(beta_new, y, X) - logpost(beta, y, X)))
if (runif(1) < alpha) {
beta[j] <- beta_new[j] # Accept the value
accepted[j] <- accepted[j] + 1
}
}
# Store the values after the burn-in period
if (r > burn_in) {
out[r - burn_in, ] <- beta
}
}
out
}set.seed(123)
# Running the MCMC
fit_MCMC <- as.mcmc(RMH_Gibbs_Adaptive(R = R, burn_in = burn_in, y, X)) # Convert the matrix into a "coda" object
# Diagnostic
effectiveSize(fit_MCMC) # Effective sample size
var1 var2 var3 var4 var5 var6 var7 var8
973.6199 599.0175 787.8033 1096.0456 671.1191 597.0799 906.0121 585.4803
R / effectiveSize(fit_MCMC) # Integrated autocorrelation time
var1 var2 var3 var4 var5 var6 var7 var8
30.81285 50.08201 38.08057 27.37112 44.70145 50.24453 33.11214 51.23998
1 - rejectionRate(fit_MCMC) # Acceptance rate
var1 var2 var3 var4 var5 var6 var7 var8
0.4472149 0.4471816 0.4511817 0.4503150 0.4485483 0.4493483 0.4490150 0.4490816
# Traceplot of the intercept
plot(fit_MCMC[, 1:2])